In Spring 2021, I taught a course entitled taught a course “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. I had bright students, but ones who typically had NO prior experience with either statistical analysis or computer programming. So TSD was designed as a soft entry, beginner-level guide to working with data. For our stories, we started with various data sets available in R packages – the usual suspects – and then progressed to data sets in the wild.
Our capstone project required to the students to use two data sets from the Gapminder.org foundation to build a dashboard and tell a coherent story with visualizations, summary stats, text contextualization and analyses, and at least one basic model or hypothesis test. (Three capstone project examples: Bonnie, Chanley, and Bethia).
library(tidyverse)
library(here)
library(visdat)
library(here)
For their capstone projects, nearly all the students wanted to include Choropleth maps. Understandable. The students were working with global data, and Choropleth maps both look impressive and are useful. But we ran into some problems.
Gapminder.org uses the nation-state as a primary unit of analysis: the country variable in their data sets. But they do NOT include the standard ISO country codes. The R ecosystem has various mapping tools, some well outside the Tidyverse, but all of which require remapping at least some of the Gapminder country names to the geo-data units; or, vice versa. So until we have all the primary units remapped, we get something like this:
load(here::here("data", "tidy_data", "cmap.rda") )
bad_ex
To simplify this as the course required, and to stay largely within the Tidyverse ecosystem so as to avoid cognitive overload, we went with the world map from ggplot2::map_data("world"). To ensure compabitility with the Gapminder.org date, we created a new data set with the geo-mapping information: world_map2. We fixed most of the flaws, and did “good enough” quick and dirty Choropleth maps – but I wanted to finish what we started.
If data was missing for a given nation for a given year, we wanted to know that. We also wanted our mapping data compatible across all the Gapminder.org data sets. We wanted a result more like this:
plotly::ggplotly(good_ex)
We have two core problems with using country names as the primary unit for defining the mapping polygon. First, the names are not consistent across data sets. The informal names of the nations can vary greatly; the formal names, often too long for appropriate labeling and generally not even recorded. In the Gapminder.org data sets, which largely share a common source, for Sint Maarten, we have two values: “Sint Maarten” and “Sint Maarten (Dutch part)”. This because the same island also contains the the Collectivity of Saint Martin, more commonly known as the French “Saint Martin”. When we move from the Gapminder.org data sets to others, the country name values can vary greatly. In the map data that ships with ggplot2, the preferred “Eswatini” is the older “Swaziland”; “North Macedonia” as of 2019, the older “Macedonia”; and so on.
The obvious solution to this problem of inconsistent country nomenclature: use the ISO3 codes: the three letter designation, or the three digit ONU, or both. Neither the Gapminder.org data sets nor the shipped ggplot2 data does so.
Second, in practical terms, we have no simple definition of what comprises a country. As of 4 September 2020, Kosovo was recognized by 97 out of 193 (50.26%) United Nations member states; as of July 2021, Western Sahara was recognized by 45 out of a total of 193 United Nations member states. Likewise, we also have existing designations that do not distinguish clearly between geographical boundaries and political boundaries. Some of the Gapminder data sets, for example, report on the “Channel Islands”: more properly, the two Crown dependencies, the Bailiwick of Jersey, and the Bailiwick of Guernsey. But as Wikipedia correctly reports: “‘Channel Islands’ is a geographical term, not a political unit. The two bailiwicks have been administered separately since the late 13th century. Each has its own independent laws, elections, and representative bodies…. Any institution common to both is the exception rather than the rule.” Jersey, for example, is “a self-governing parliamentary democracy under a constitutional monarchy, with its own financial, legal and judicial systems, and the power of self-determination”: for mappping purposes, it also has its own ISO codes. In truth, it makes more sense NOT to lump Jersey and Guernsey together for the purposes of economic, social, and public health data analysis. Even if Gapminder.org and/or the World Bank did for some data collections.
Finally, on this point, some of the Gapminder data sets also include as a country value the dissolved Netherlands Antilles. If we keep this historical designation which is needed for only a limited number of data analyses, we must otherwise ignore the data for the now independent nations of Aruba and Curacao. So although I want a mapping data set highly compatible with the Gapminder.org data sets, it should also work with any global studies data set. The value “Netherlands Antilles” will be dropped.
Although I am updating the world map from ggplot2::map_data("world") to generally work better with the Gapminder.org data sets, I am also including the ISO country code data so that name-matching becomes largely irrelevant when one is working with data sets in the which the national and territorial entities have been properly designated. Regretfully, this seems more the exception than the rule.
The remainder of this document defines the process of creating world_map2, our new data set directly dervied from ggplot2::map_data("world"). It – world_map2, this RMD, and a supporting case study, are available at github.com/Thom-J-H/map_Gap_2_Tidy. I include in this document below athe steps and rationale involved for full transparency and in hopes that other people can improve upon this effort or offer a better solution for working with Global Studies data sets (like the Gapminder.org data) in the Tidyverse.
The Gapminder.org data sets available for download are generally sourced to the World Bank and available under a Creative Commons Attribution 4.0 International license. They cover global trends with the nation-state, the variable country, as a primary level of analysis. The data is also organized chronologically, by year.
Between data sets, the names for countries are generally consistent. Some sets do cover more nations (and territories and sub-national units).
We will use four sets below to test differences in coverage, and to build a country names reference.
# Data sets from Gapminder.org --------------------------------------------
life_expectancy_years <- read_csv(here::here("data",
"raw_data",
"life_expectancy_years.csv") ,
show_col_types = FALSE)
total_fertility <- read_csv(here::here("data",
"raw_data",
"children_per_woman_total_fertility.csv"),
show_col_types = FALSE)
energy_use_per_person <- read_csv(here::here("data",
"raw_data",
"energy_use_per_person.csv"),
show_col_types = FALSE)
demox_eiu <- read_csv(here::here("data",
"raw_data",
"demox_eiu.csv"),
show_col_types = FALSE)
The Gapminder.org data sets are untidy, and in long format. We’ll deal with those issues later. Depending on the primary variable of interest, we have a different range of nations and years covered. For example, the data set for Life Expectancy (years) has 189 designated countries; the data set for Total Fertility, 202 countries; the data set for Energy Use per Capita, 169 nations; and the data set for Democracy Index (EIU), 166 nations. But Total Fertility, to take one comparison, does not simply have 13 more listed countries than Life Expectancy (years): we have meaningful set differences in coverage between the sets.
Set Differences
life_expectancy_years %>% head()
## # A tibble: 6 x 302
## country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani~ 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 28.1
## 2 Angola 27 27 27 27 27 27 27 27 27 27
## 3 Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4
## 4 Andorra NA NA NA NA NA NA NA NA NA NA
## 5 United ~ 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7
## 6 Argenti~ 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2
## # ... with 291 more variables: 1810 <dbl>, 1811 <dbl>, 1812 <dbl>, 1813 <dbl>,
## # 1814 <dbl>, 1815 <dbl>, 1816 <dbl>, 1817 <dbl>, 1818 <dbl>, 1819 <dbl>,
## # 1820 <dbl>, 1821 <dbl>, 1822 <dbl>, 1823 <dbl>, 1824 <dbl>, 1825 <dbl>,
## # 1826 <dbl>, 1827 <dbl>, 1828 <dbl>, 1829 <dbl>, 1830 <dbl>, 1831 <dbl>,
## # 1832 <dbl>, 1833 <dbl>, 1834 <dbl>, 1835 <dbl>, 1836 <dbl>, 1837 <dbl>,
## # 1838 <dbl>, 1839 <dbl>, 1840 <dbl>, 1841 <dbl>, 1842 <dbl>, 1843 <dbl>,
## # 1844 <dbl>, 1845 <dbl>, 1846 <dbl>, 1847 <dbl>, 1848 <dbl>, 1849 <dbl>, ...
demox_eiu %>% vis_dat()
life_expectancy_years %>%
arrange(country) %>%
select(country)
## # A tibble: 189 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Andorra
## 5 Angola
## 6 Antigua and Barbuda
## 7 Argentina
## 8 Armenia
## 9 Australia
## 10 Austria
## # ... with 179 more rows
demox_eiu %>%
arrange(country) %>%
select(country)
## # A tibble: 166 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Angola
## 5 Argentina
## 6 Armenia
## 7 Australia
## 8 Austria
## 9 Azerbaijan
## 10 Bahrain
## # ... with 156 more rows
life_expectancy_years$country %>%
n_distinct()
## [1] 189
energy_use_per_person$country %>%
n_distinct()
## [1] 169
Let’s explore the similarities and differences in coverage for the country variable between the sets.
### Fertility vs. Life coverage
setdiff(total_fertility$country, life_expectancy_years$country) %>%
knitr::kable(caption = "Fertility vs. Life coverage" ,
row.names = TRUE)
| x | |
|---|---|
| 1 | Aruba |
| 2 | Netherlands Antilles |
| 3 | Channel Islands |
| 4 | Western Sahara |
| 5 | Guadeloupe |
| 6 | Greenland |
| 7 | French Guiana |
| 8 | Guam |
| 9 | Macao, China |
| 10 | Martinique |
| 11 | Mayotte |
| 12 | New Caledonia |
| 13 | Puerto Rico |
| 14 | French Polynesia |
| 15 | Reunion |
| 16 | Virgin Islands (U.S.) |
### Life vs. Fertility coverage
setdiff(life_expectancy_years$country, total_fertility$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Life vs. Fertility coverage" ,
row.names = TRUE)
| diff | |
|---|---|
| 1 | Andorra |
| 2 | Dominica |
| 3 | Marshall Islands |
### Fertility vs. Energy coverage"
setdiff(total_fertility$country, energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Fertility vs. Energy coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Aruba |
| 2 | Afghanistan |
| 3 | Netherlands Antilles |
| 4 | Burundi |
| 5 | Burkina Faso |
| 6 | Central African Republic |
| 7 | Channel Islands |
| 8 | Western Sahara |
| 9 | Micronesia, Fed. Sts. |
| 10 | Guinea |
| 11 | Guadeloupe |
| 12 | Greenland |
| 13 | French Guiana |
| 14 | Guam |
| 15 | Hong Kong, China |
| 16 | Lao |
| 17 | Liberia |
| 18 | Macao, China |
| 19 | Madagascar |
| 20 | Mali |
| 21 | Mauritania |
| 22 | Martinique |
| 23 | Malawi |
| 24 | Mayotte |
| 25 | New Caledonia |
| 26 | Papua New Guinea |
| 27 | Puerto Rico |
| 28 | Palestine |
| 29 | French Polynesia |
| 30 | Reunion |
| 31 | Rwanda |
| 32 | Sierra Leone |
| 33 | Somalia |
| 34 | Chad |
| 35 | Taiwan |
| 36 | Uganda |
| 37 | Virgin Islands (U.S.) |
### Energy vs. Fertility coverage
setdiff(energy_use_per_person$country, total_fertility$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Fertility coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Dominica |
| 2 | Marshall Islands |
| 3 | Palau |
| 4 | St. Kitts and Nevis |
### Life vs. Energy coverage
setdiff(life_expectancy_years$country, energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Life vs. Energy coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Afghanistan |
| 2 | Andorra |
| 3 | Burundi |
| 4 | Burkina Faso |
| 5 | Central African Republic |
| 6 | Micronesia, Fed. Sts. |
| 7 | Guinea |
| 8 | Hong Kong, China |
| 9 | Lao |
| 10 | Liberia |
| 11 | Madagascar |
| 12 | Mali |
| 13 | Mauritania |
| 14 | Malawi |
| 15 | Papua New Guinea |
| 16 | Palestine |
| 17 | Rwanda |
| 18 | Sierra Leone |
| 19 | Somalia |
| 20 | Chad |
| 21 | Taiwan |
| 22 | Uganda |
### Energy vs. Life coverage
setdiff(energy_use_per_person$country,life_expectancy_years$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Life coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Palau |
| 2 | St. Kitts and Nevis |
### Energy vs. Democracy coverage
setdiff(energy_use_per_person$country,demox_eiu$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Democracy coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Antigua and Barbuda |
| 2 | Bahamas |
| 3 | Barbados |
| 4 | Belize |
| 5 | Brunei |
| 6 | Dominica |
| 7 | Georgia |
| 8 | Grenada |
| 9 | Kiribati |
| 10 | Maldives |
| 11 | Marshall Islands |
| 12 | Palau |
| 13 | Samoa |
| 14 | Sao Tome and Principe |
| 15 | Seychelles |
| 16 | Solomon Islands |
| 17 | South Sudan |
| 18 | St. Kitts and Nevis |
| 19 | St. Lucia |
| 20 | St. Vincent and the Grenadines |
| 21 | Tonga |
| 22 | Vanuatu |
### Democracy vs. Energy coverage
setdiff(demox_eiu$country,energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Democracy vs. Energy coverage",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Afghanistan |
| 2 | Burundi |
| 3 | Burkina Faso |
| 4 | Central African Republic |
| 5 | Guinea |
| 6 | Hong Kong, China |
| 7 | Lao |
| 8 | Liberia |
| 9 | Madagascar |
| 10 | Mali |
| 11 | Mauritania |
| 12 | Malawi |
| 13 | Papua New Guinea |
| 14 | Palestine |
| 15 | Rwanda |
| 16 | Sierra Leone |
| 17 | Chad |
| 18 | Taiwan |
| 19 | Uganda |
Overall, we have fairly complete vector of country values. Since the law of diminishing returns has set in our Gapminder data set comparisons, let’s build a country name list to test against our map data coverage.
# Gapminder Country Name Reference DF -------------------------------------
country_names <- demox_eiu %>%
select(country) %>%
full_join(energy_use_per_person, by = "country") %>%
select(country) %>%
full_join(total_fertility, by = "country") %>%
select(country) %>%
full_join(life_expectancy_years, by = "country") %>%
select(country) %>%
arrange(country)
## Current working total
country_names$country %>%
n_distinct()
## [1] 207
country_names %>% head(n = 10) %>%
knitr::kable(caption = "First Ten Country Designations",
row.names = TRUE)
| country | |
|---|---|
| 1 | Afghanistan |
| 2 | Albania |
| 3 | Algeria |
| 4 | Andorra |
| 5 | Angola |
| 6 | Antigua and Barbuda |
| 7 | Argentina |
| 8 | Armenia |
| 9 | Aruba |
| 10 | Australia |
country_names %>% tail(n = 10) %>%
knitr::kable(caption = "Last Ten Country Designations",
row.names = TRUE)
| country | |
|---|---|
| 1 | Uruguay |
| 2 | Uzbekistan |
| 3 | Vanuatu |
| 4 | Venezuela |
| 5 | Vietnam |
| 6 | Virgin Islands (U.S.) |
| 7 | Western Sahara |
| 8 | Yemen |
| 9 | Zambia |
| 10 | Zimbabwe |
So at this point we have 207 unique country level units of analysis. Please note that some country designations are better understood as regions within a nation-state, or as overseas territories belonging to a nation-state, rather than as distinct nation-states as recognized by the United Nations or the international community.
In the data set world_map, derived from ggplot2::map_data("world"), the region variable generally corresponds with the Gapminder country variable: but it can also define geographical rather than political entities. We need to dig into the map data subregion to obtain a proper match with country.
Let’s have a look.
world_map <- ggplot2::map_data("world")
world_map %>% vis_dat()
## Basic unit is region; subregion mostly NA
world_map %>% glimpse()
## Rows: 99,338
## Columns: 6
## $ long <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.0~
## $ lat <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707, ~
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 1~
## $ region <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba~
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## group or groups belong to regions
## order refers the long and lat coordinates for mapping
## long == longitude lat == latitude
world_map %>%
skimr::skim(region, subregion)
| Name | Piped data |
| Number of rows | 99338 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| region | 0 | 1.00 | 2 | 35 | 0 | 252 | 0 |
| subregion | 63154 | 0.36 | 1 | 33 | 0 | 1069 | 0 |
## more regions than gapminder countries
## difference in emphasis
## subregion contains some units which gapminder treats a country
The ggplot2 map data for a global Mercator projection uses region as the primary unit. A region can have subregions, but always consists of at least one group. The group marks out the polygon to be mapped and filled, by longitude long and latitude lat coordinates in their appropriate order order: some regions and even subregions require multiple groups to draw the appropriate shape. The subregion “Hong Kong”, for example, has three distinct groups: 668, 669, and 670.
In the majority of cases, we have either an existing or an obvious match between the Gapminder country and the map data region variables. But for that minority, we need to dig through the vectors. Below are some tools for that task.
### Check for example South Sudan
world_map %>%
filter(stringr::str_detect(region, "Sudan") ) %>%
distinct(region)
## region
## 1 Sudan
## 2 South Sudan
### Check for example South Sudan
world_map %>%
filter(stringr::str_detect(region, "South") ) %>%
distinct(region)
## region
## 1 French Southern and Antarctic Lands
## 2 South Korea
## 3 South Sudan
## 4 South Sandwich Islands
## 5 South Georgia
## 6 South Africa
### Check for example South Sudan
country_names %>%
filter(stringr::str_detect(country, "Sudan") )
## # A tibble: 2 x 1
## country
## <chr>
## 1 South Sudan
## 2 Sudan
### Check for example South Sudan
country_names %>%
filter(stringr::str_detect(country, "South") )
## # A tibble: 3 x 1
## country
## <chr>
## 1 South Africa
## 2 South Korea
## 3 South Sudan
### Check for example Hong Kong
world_map %>%
filter(stringr::str_detect(region, "Hong Kong") ) %>%
distinct(region) # NO!
## [1] region
## <0 rows> (or 0-length row.names)
### Check for example Hong Kong
world_map %>%
filter(stringr::str_detect(subregion, "Hong Kong") ) %>%
distinct(region, subregion) # YES!
## region subregion
## 1 China Hong Kong
### Group IDs for coordinates data
world_map %>%
filter(stringr::str_detect(subregion, "Hong Kong") ) %>%
select(group) %>%
distinct()
## group
## 1 668
## 2 669
## 3 670
Let’s identify the mismatches and work to correct as many as possible.
####Identify key differences --------------------------------------
map_vs_gap <- setdiff(world_map$region, country_names$country) %>%
enframe(name = NULL, value = "desn") %>%
arrange(desn)
gap_vs_map <- setdiff(country_names$country, world_map$region) %>%
enframe(name = NULL, value = "desn") %>%
arrange(desn)
map_vs_gap %>%
knitr::kable(caption = "Map regions vs. Gap countries: Coverage diff",
row.names = TRUE)
| desn | |
|---|---|
| 1 | American Samoa |
| 2 | Anguilla |
| 3 | Antarctica |
| 4 | Antigua |
| 5 | Ascension Island |
| 6 | Azores |
| 7 | Barbuda |
| 8 | Bermuda |
| 9 | Bonaire |
| 10 | Canary Islands |
| 11 | Cayman Islands |
| 12 | Chagos Archipelago |
| 13 | Christmas Island |
| 14 | Cocos Islands |
| 15 | Cook Islands |
| 16 | Curacao |
| 17 | Democratic Republic of the Congo |
| 18 | Falkland Islands |
| 19 | Faroe Islands |
| 20 | French Southern and Antarctic Lands |
| 21 | Grenadines |
| 22 | Guernsey |
| 23 | Heard Island |
| 24 | Isle of Man |
| 25 | Ivory Coast |
| 26 | Jersey |
| 27 | Kosovo |
| 28 | Kyrgyzstan |
| 29 | Laos |
| 30 | Liechtenstein |
| 31 | Macedonia |
| 32 | Madeira Islands |
| 33 | Micronesia |
| 34 | Monaco |
| 35 | Montserrat |
| 36 | Nauru |
| 37 | Nevis |
| 38 | Niue |
| 39 | Norfolk Island |
| 40 | Northern Mariana Islands |
| 41 | Pitcairn Islands |
| 42 | Republic of Congo |
| 43 | Saba |
| 44 | Saint Barthelemy |
| 45 | Saint Helena |
| 46 | Saint Kitts |
| 47 | Saint Lucia |
| 48 | Saint Martin |
| 49 | Saint Pierre and Miquelon |
| 50 | Saint Vincent |
| 51 | San Marino |
| 52 | Siachen Glacier |
| 53 | Sint Eustatius |
| 54 | Sint Maarten |
| 55 | Slovakia |
| 56 | South Georgia |
| 57 | South Sandwich Islands |
| 58 | Swaziland |
| 59 | Tobago |
| 60 | Trinidad |
| 61 | Turks and Caicos Islands |
| 62 | UK |
| 63 | USA |
| 64 | Vatican |
| 65 | Virgin Islands |
| 66 | Wallis and Futuna |
gap_vs_map %>%
knitr::kable(caption = "Gap countries vs. Map regions: Coverage diff",
row.names = TRUE)
| desn | |
|---|---|
| 1 | Antigua and Barbuda |
| 2 | Channel Islands |
| 3 | Congo, Dem. Rep. |
| 4 | Congo, Rep. |
| 5 | Cote d’Ivoire |
| 6 | Eswatini |
| 7 | Hong Kong, China |
| 8 | Kyrgyz Republic |
| 9 | Lao |
| 10 | Macao, China |
| 11 | Micronesia, Fed. Sts. |
| 12 | Netherlands Antilles |
| 13 | North Macedonia |
| 14 | Slovak Republic |
| 15 | St. Kitts and Nevis |
| 16 | St. Lucia |
| 17 | St. Vincent and the Grenadines |
| 18 | Trinidad and Tobago |
| 19 | United Kingdom |
| 20 | United States |
| 21 | Virgin Islands (U.S.) |
When going from the map region variable to Gapminder country values, we find 66 differences. Some of these are geographical entities or national subregions or overseas territories that we would not expect to find considered in the Gapminder data. Others are simple mismatches easily reconciled. Another group is a bit more tricky for coding, but logically straightforward. For example, the country is “Trinidad and Tobago”: the two primary geographical entities are islands “Trinidad” and “Tobago”, both region values in the map data.
When going from the Gapminder country to the map region values, our primary concern for reconciliation, we find 21 differences. These break down into four rough categories: Easy Cases, Island Nations, Subregion Promotion, and Do Not Restore.
The first, the easy cases, is rather straightforward.
### Easy cases -- see tools above for digging out names
world_map2 <- world_map %>%
rename(country = region) %>%
mutate(country = case_when(country == "Macedonia" ~ "North Macedonia" ,
country == "Ivory Coast" ~ "Cote d'Ivoire",
country == "Democratic Republic of the Congo" ~ "Congo, Dem. Rep.",
country == "Republic of Congo" ~ "Congo, Rep.",
country == "UK" ~ "United Kingdom",
country == "USA" ~ "United States",
country == "Laos" ~ "Lao",
country == "Slovakia" ~ "Slovak Republic",
country == "Saint Lucia" ~ "St. Lucia",
country == "Kyrgyzstan" ~ "Kyrgyz Republic",
country == "Micronesia" ~ "Micronesia, Fed. Sts.",
country == "Swaziland" ~ "Eswatini",
country == "Virgin Islands" ~ "Virgin Islands (U.S.)",
TRUE ~ country))
### Progress check
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Remaining Cases",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Antigua and Barbuda |
| 2 | Channel Islands |
| 3 | Hong Kong, China |
| 4 | Macao, China |
| 5 | Netherlands Antilles |
| 6 | St. Kitts and Nevis |
| 7 | St. Vincent and the Grenadines |
| 8 | Trinidad and Tobago |
We have eight remaining cases.
The cases of Antigua and Barbuda, St. Kitts and Nevis, Trinidad and Tobago, and St. Vincent and the Grenadines are all similar: combine the related map region designations to the appropriate new country designation. In each instance, this will require creating new group values while retaining the order, long and lat values.
## Get data for Island nations
match_names <- c("Antigua" , "Barbuda", "Nevis",
"Saint Kitts", "Trinidad" ,
"Tobago", "Grenadines" , "Saint Vincent")
### Island nations data set
map_match <- world_map2 %>%
filter(country %in% match_names)
map_match %>% distinct(country)
## country
## 1 Antigua
## 2 Barbuda
## 3 Nevis
## 4 Saint Kitts
## 5 Trinidad
## 6 Tobago
## 7 Grenadines
## 8 Saint Vincent
### Group IDs for the countries
ant_bar <- c(137 ,138 )
kit_nev <- c(930 , 931)
tri_tog <- c(1425, 1426)
vin_gre <- c(1575, 1576, 1577)
# chan_isl <- c(594, 861)
# neth_ant <- c(1055, 1056)
### assign new group ID
map_match <- map_match %>%
mutate(group = case_when(group == 137 ~ 2001,
group == 138 ~ 2002,
group == 930 ~ 2003,
group == 931 ~ 2004,
group == 1425 ~ 2005,
group == 1426 ~ 2006,
group == 1575 ~ 2007,
group == 1576 ~ 2008,
group == 1577 ~ 2009) )
map_match %>%
distinct(group)
## group
## 1 2001
## 2 2002
## 3 2003
## 4 2004
## 5 2005
## 6 2006
## 7 2007
## 8 2008
## 9 2009
ant_barn <- c(2001 ,2002 )
kit_nevn <- c(2003, 2004)
tri_togn <- c(2005, 2006)
vin_gren <- c(2007, 2008, 2009)
new_names_ref <- c("Antigua and Barbuda", "St. Kitts and Nevis",
"Trinidad and Tobago", "St. Vincent and the Grenadines")
### assign new country names to match Gapminder
map_match <- map_match %>%
mutate(country = case_when(group %in% ant_barn ~ "Antigua and Barbuda" ,
group %in% kit_nevn ~ "St. Kitts and Nevis" ,
group %in% tri_togn ~ "Trinidad and Tobago" ,
group %in% vin_gren ~ "St. Vincent and the Grenadines")
) %>%
tibble()
### Quick checks
map_match %>% head()
## # A tibble: 6 x 6
## long lat group order country subregion
## <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -61.7 17.0 2001 7243 Antigua and Barbuda <NA>
## 2 -61.7 17.0 2001 7244 Antigua and Barbuda <NA>
## 3 -61.9 17.0 2001 7245 Antigua and Barbuda <NA>
## 4 -61.9 17.1 2001 7246 Antigua and Barbuda <NA>
## 5 -61.9 17.1 2001 7247 Antigua and Barbuda <NA>
## 6 -61.8 17.2 2001 7248 Antigua and Barbuda <NA>
map_match %>%
distinct(country)%>%
knitr::kable(caption = "Add to World Map")
| country |
|---|
| Antigua and Barbuda |
| St. Kitts and Nevis |
| Trinidad and Tobago |
| St. Vincent and the Grenadines |
map_match %>%
group_by(country) %>%
count(group) %>%
knitr::kable(caption = "Add to World Map")
| country | group | n |
|---|---|---|
| Antigua and Barbuda | 2001 | 12 |
| Antigua and Barbuda | 2002 | 10 |
| St. Kitts and Nevis | 2003 | 7 |
| St. Kitts and Nevis | 2004 | 13 |
| St. Vincent and the Grenadines | 2007 | 16 |
| St. Vincent and the Grenadines | 2008 | 23 |
| St. Vincent and the Grenadines | 2009 | 10 |
| Trinidad and Tobago | 2005 | 30 |
| Trinidad and Tobago | 2006 | 8 |
#### Structure check for merge
map_match %>%
str()
## tibble [129 x 6] (S3: tbl_df/tbl/data.frame)
## $ long : num [1:129] -61.7 -61.7 -61.9 -61.9 -61.9 ...
## $ lat : num [1:129] 17 17 17 17.1 17.1 ...
## $ group : num [1:129] 2001 2001 2001 2001 2001 ...
## $ order : int [1:129] 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 ...
## $ country : chr [1:129] "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" ...
## $ subregion: chr [1:129] NA NA NA NA ...
world_map2 %>%
str()
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
#### Time to Slice, Dice, and Restack
world_map2 <- world_map2 %>%
filter(!country %in% match_names)
world_map2 <- world_map2 %>%
bind_rows(map_match) %>%
arrange(country) %>%
tibble()
### Safety check -- should return empty set
world_map2 %>%
filter(country %in% match_names)
## # A tibble: 0 x 6
## # ... with 6 variables: long <dbl>, lat <dbl>, group <dbl>, order <int>,
## # country <chr>, subregion <chr>
### Safety check - should return one complete row each
world_map2 %>%
filter(country %in% new_names_ref) %>%
group_by(country) %>%
slice_max(order, n = 1)
## # A tibble: 4 x 6
## # Groups: country [4]
## long lat group order country subregion
## <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -61.7 17.6 2002 7265 Antigua and Barbuda <NA>
## 2 -62.6 17.2 2004 58081 St. Kitts and Nevis <NA>
## 3 -61.2 13.2 2009 98189 St. Vincent and the Grenadines <NA>
## 4 -60.8 11.2 2006 89453 Trinidad and Tobago <NA>
The cases of Macao, China, and Hong Kong, China differ again: in the map data set, each is a subregion of the region China. But economic and public health data for both former city-states, now Special Administrative Regions in China, has for decades and continues to be treated separately from that of mainland China (PRC). Each, for the purposes of Global Studies, has country level status (which is not the same as nation-state status). So we should follow practice and and treat them as country-level entities in terms of the map data set.
####
### Hong Kong and Macao
#### Pull from subregion; slice out; restack
sub_sleeps <- c("Hong Kong", "Macao")
hk_mc <- world_map2 %>%
filter(subregion %in% sub_sleeps)
hk_mc <- hk_mc %>%
mutate(country = case_when(subregion == "Hong Kong" ~ "Hong Kong, China" ,
subregion == "Macao" ~ "Macao, China" ) ) %>%
mutate(group = case_when(group == 668 ~ 2010,
group == 669 ~ 2011,
group == 670 ~ 2012,
group == 960 ~ 2013))
### Safety check for bind_rows()
hk_mc %>%
slice(38:41) %>%
knitr::kable(caption = "Check structure")
| long | lat | group | order | country | subregion |
|---|---|---|---|---|---|
| 114.0067 | 22.48403 | 2012 | 45801 | Hong Kong, China | Hong Kong |
| 114.0154 | 22.51191 | 2012 | 45802 | Hong Kong, China | Hong Kong |
| 113.4789 | 22.19556 | 2013 | 59893 | Macao, China | Macao |
| 113.4810 | 22.21748 | 2013 | 59894 | Macao, China | Macao |
### Slice out old info
world_map2 <- world_map2 %>%
filter(!subregion %in% sub_sleeps)
### Stack in new info
world_map2 <- world_map2 %>%
bind_rows(hk_mc) %>%
select(-subregion) %>%
tibble()
### Progress check
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Remaining Cases",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Channel Islands |
| 2 | Netherlands Antilles |
Finally, we have some cases we arguably should not reconcile. The Netherlands Antilles was dissolved in 2010. It consisted of the islands Curaçao, Bonaire, Aruba (until 1986), Saba, Sint Eustatius, and Sint Maarten. Aruba, which has a country designation in the Gapminder data, is a “a constituent country of the Kingdom of the Netherlands”; Curaçao and Sint Maarten, likewise. Each has its own ISO code. Bonaire, Saba, and Sint Eustatius are special municipalities within the country of the Netherlands: all share the same ISO code. By recombining these various constituent countries and special municipalities back into the historical Netherlands Antilles, itself once a constituent country of the Kingdom of the Netherlands, we would do so at the cost of current (since 2010) and future compatibility with data collection and analysis.
Likewise, we should pass on restoring the historical designation the Channel Islands. The Channel Islands primarily consist of the Bailiwick of Guernsey and the Bailiwick of Jersey. The two main islands, Guernsey and Jersey, make up 99% of the population and 92% of the land area. So we could combine these from the map region to make the “Channel Islands” country designation: but the case against doing so is stronger. First, both Guernsey and Jersey have their onw ISO codes; in contrast, the Channel Islands ISO code is now defunct. Second, perhaps more to the point, as Wikipedia correctly records: “‘Channel Islands’ is a geographical term, not a political unit. The two bailiwicks have been administered separately since the late 13th century. Each has its own independent laws, elections, and representative bodies… Any institution common to both is the exception rather than the rule.”
world_map2 %>% distinct(country) %>%
DT::datatable(caption = "Map Country List")
### No Tuvalu in map -- add coordinates
world_map2 %>%
filter(stringr::str_detect(country, "Tu") ) %>%
distinct(country)
## # A tibble: 4 x 1
## country
## <chr>
## 1 Tunisia
## 2 Turkey
## 3 Turkmenistan
## 4 Turks and Caicos Islands
We now have a new problem. Our map data lacks coordinates –indeed, entries – for countries or subregions which have ISO codes: the nation Tuvalu, for example, and the territories of Gibraltar and the British Virgin Islands. None of which currently show in our list of countries for the map data. For these three cases– but regretfully, not for all – we can download the polygon coordinates from OpenDataSource. Some hacking around (not on display here) will get us compatible data sets.
### let's get a count so far
#world_map2$country %>% n_distinct()
### From https://public.opendatasoft.com/
tuvalu_coords <- readRDS(here::here("data",
"tidy_data",
"tuvalu_coords.rds") )
tuvalu_coords %>% head() ## check structure
## # A tibble: 6 x 5
## long lat group order country
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 179. -8.55 2010 110000 Tuvalu
## 2 179. -8.56 2010 110001 Tuvalu
## 3 179. -8.47 2010 110003 Tuvalu
## 4 179. -8.48 2010 110004 Tuvalu
## 5 179. -8.49 2010 110005 Tuvalu
## 6 179. -8.50 2011 110006 Tuvalu
## Add to map
world_map2<- world_map2 %>%
bind_rows(tuvalu_coords) %>%
arrange(country)
## Check!
world_map2 %>%
filter(stringr::str_detect(country, "Tu") ) %>%
distinct(country)
## # A tibble: 5 x 1
## country
## <chr>
## 1 Tunisia
## 2 Turkey
## 3 Turkmenistan
## 4 Turks and Caicos Islands
## 5 Tuvalu
We’ve successfully added Tuvalu. Now, for Gibraltar and the British Virgin Islands.
### Missing also Gibraltar & Virgin Islands (British)
### From https://public.opendatasoft.com/
Gib_BVI_coords <- readRDS(file = here::here("data",
"tidy_data",
"Gib_BVI_coords.rds"))
Gib_BVI_coords %>% head()
## # A tibble: 6 x 5
## long lat group order country
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -5.36 36.1 2014 110035 Gibraltar
## 2 -5.34 36.1 2014 110036 Gibraltar
## 3 -5.34 36.1 2014 110037 Gibraltar
## 4 -5.35 36.1 2014 110038 Gibraltar
## 5 -5.36 36.1 2014 110039 Gibraltar
## 6 -64.6 18.3 2021 110045 Virgin Islands (British)
world_map2 <- world_map2 %>%
bind_rows(Gib_BVI_coords) %>%
arrange(country)
world_map2 %>%
filter(stringr::str_detect(country, "Gib") ) %>%
distinct(country)
## # A tibble: 1 x 1
## country
## <chr>
## 1 Gibraltar
world_map2 %>%
filter(stringr::str_detect(country, "Vir") ) %>%
distinct(country)
## # A tibble: 2 x 1
## country
## <chr>
## 1 Virgin Islands (British)
## 2 Virgin Islands (U.S.)
We now have a map which provides near-complete of the Gapminder.org data sets, and will work for other Global Studies data sets. We need now to add the ISO 3166-1 Country Codes to our map data: in particular, the Alpha-2 code, the Alpha-3 code, and the Numeric code. This will ensure compatibility with a greater range of Global Studies data sets.
Please note that the country_ISO_codes data set below was compiled and cross-checked using various open sources. But as ISO 3166-1 is a moving target (an ongoing process), this data set will need checking and updating.
country_ISO_codes <- readRDS(file = here::here("data",
"tidy_data",
"country_ISO_codes2.rds") )
country_ISO_codes %>% head()
## # A tibble: 6 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Afghanistan AF AFG 4 Islamic Republic of Afghanistan
## 2 Aland Islands AX ALA 248 Åland
## 3 Albania AL ALB 8 The Republic of Albania
## 4 Algeria DZ DZA 12 The People's Democratic Republic of Alg~
## 5 American Samoa AS ASM 16 The Territory of American Samoa
## 6 Andorra AD AND 20 The Principality of Andorra
It turns out, however, that like our map data, our master list of ISO Country Codes was also not complete. Finding a free and reliable Open Source version is not easy – and I do not have access to the commericial version. So, below, how to update country_ISO_codes .
### Missing Norfolk Island
norfolk_codes <- tibble(s_name = "Norfolk Island",
code_2 = "NF",
code_3 = "NFK",
code_num = 574,
form_name = "Territory of Norfolk Island, Australia")
norfolk_codes %>% head()
## # A tibble: 1 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Norfolk Island NF NFK 574 Territory of Norfolk Island, Australia
country_ISO_codes2 <- country_ISO_codes %>%
bind_rows(norfolk_codes) %>%
arrange(s_name)
country_ISO_codes2 %>%
filter(code_2 == "NF") %>%
slice(n=1)
## # A tibble: 1 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Norfolk Island NF NFK 574 Territory of Norfolk Island, Australia
Now that we have our ISO Country Codes loaded and updated, we are almost ready to add them to the map data. One set of checks for differences.
### Remaining Gapmminder cases -- the two historical entities
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Gap vs Map: Remaining Cases",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Channel Islands |
| 2 | Netherlands Antilles |
setdiff(country_ISO_codes2$s_name , world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "ISO vs Map: Remaining Cases",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Aland Islands |
| 2 | Bouvet Island |
| 3 | British Indian Ocean Territory |
| 4 | Svalbard and Jan Mayen |
| 5 | Tokelau |
| 6 | United States Minor Outlying Islands |
setdiff(world_map2$country, country_ISO_codes2$s_name) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Map vs. ISO: Remaining Cases",
row.names = TRUE)
| diff | |
|---|---|
| 1 | Siachen Glacier |
So our checks indicate success. First, we declined to restore the two defunct designations, once of which reflected an historical country-level entity, and the other, a geographically convenient label. Our map data set by decision will not account for the Netherlands Antilles and the Channel Islands.
Second, of the ISO vs Map cases, only the sparsely populated Tokelau possibly matters, but OpenDataSoft does not have the polygon coordinates for it. The Chagos Archipelago, included in our map, makes up the most important part of the British Indian Ocean Territory. The remaining four cases comprise seasonally inhabited regions and/or remote military bases. These produce negligible data in terms of economic or public health statistics, and can be safely ignored for those purposes.
Third and finally, the original map makers included the Siachen Glacier. This is a geographical entity and a disputed territory: but it does not have an individual ISO code, does not have civilian residents,and does not produce the relevant sort of data. So it remains excluded.
world_map2_ISO <- world_map2 %>%
left_join(country_ISO_codes2, by = c("country" = "s_name")) %>%
tibble()
world_map2_ISO %>% vis_dat()
world_map2_ISO %>% glimpse()
## Rows: 99,442
## Columns: 9
## $ long <dbl> 74.89131, 74.84023, 74.76738, 74.73896, 74.72666, 74.66895, ~
## $ lat <dbl> 37.23164, 37.22505, 37.24917, 37.28564, 37.29072, 37.26670, ~
## $ group <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order <dbl> 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ~
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ code_2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", ~
## $ code_3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG~
## $ code_num <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ form_name <chr> "Islamic Republic of Afghanistan", "Islamic Republic of Afgh~
We now have mapping data with the following standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. We can match by country name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. So the data set world_map2_ISO offers an update on ggplot2::map_data("world") with improved interoperability.
save_data <- c("world_map2_ISO",
"country_ISO_codes2")
# Save Data! --------------------------
save(list = save_data, file = here::here("data",
"tidy_data",
"maps",
"world_map2_project.rda" ))
Thomas J. Haslam
2021-08-02
github.com/Thom-J-H/map_Gap_2_Tidy